Relational Data

R4DS 09 - Relational Data with dplyr

lruolin
05-17-2021

R4DS Practice 10: dplyr

The codes below are from the practice exercises in https://r4ds.had.co.nz/, and are taken with reference from: https://jrnold.github.io/r4ds-exercise-solutions/

Let’s begin now

Loading tidyverse package.

Relational data

Typically, there are many tables of data, and you must combine them to answer the questions you are interested in. Collectively, mutiple tables of data are called relational data, becase you are interested in the relations and not just the individual datasets.

nycflights13

airlines: look up the full carrier name from its abbreviated code

airlines
# A tibble: 16 x 2
   carrier name                       
   <chr>   <chr>                      
 1 9E      Endeavor Air Inc.          
 2 AA      American Airlines Inc.     
 3 AS      Alaska Airlines Inc.       
 4 B6      JetBlue Airways            
 5 DL      Delta Air Lines Inc.       
 6 EV      ExpressJet Airlines Inc.   
 7 F9      Frontier Airlines Inc.     
 8 FL      AirTran Airways Corporation
 9 HA      Hawaiian Airlines Inc.     
10 MQ      Envoy Air                  
11 OO      SkyWest Airlines Inc.      
12 UA      United Air Lines Inc.      
13 US      US Airways Inc.            
14 VX      Virgin America             
15 WN      Southwest Airlines Co.     
16 YV      Mesa Airlines Inc.         

airports: gives information about each airport, identified by the faa airport code

airports
# A tibble: 1,458 x 8
   faa   name                 lat    lon   alt    tz dst   tzone      
   <chr> <chr>              <dbl>  <dbl> <dbl> <dbl> <chr> <chr>      
 1 04G   Lansdowne Airport   41.1  -80.6  1044    -5 A     America/Ne…
 2 06A   Moton Field Munic…  32.5  -85.7   264    -6 A     America/Ch…
 3 06C   Schaumburg Region…  42.0  -88.1   801    -6 A     America/Ch…
 4 06N   Randall Airport     41.4  -74.4   523    -5 A     America/Ne…
 5 09J   Jekyll Island Air…  31.1  -81.4    11    -5 A     America/Ne…
 6 0A9   Elizabethton Muni…  36.4  -82.2  1593    -5 A     America/Ne…
 7 0G6   Williams County A…  41.5  -84.5   730    -5 A     America/Ne…
 8 0G7   Finger Lakes Regi…  42.9  -76.8   492    -5 A     America/Ne…
 9 0P2   Shoestring Aviati…  39.8  -76.6  1000    -5 U     America/Ne…
10 0S9   Jefferson County …  48.1 -123.    108    -8 A     America/Lo…
# … with 1,448 more rows

planes: gives information about each plane, identified by its tail number

planes
# A tibble: 3,322 x 9
   tailnum  year type    manufacturer model engines seats speed engine
   <chr>   <int> <chr>   <chr>        <chr>   <int> <int> <int> <chr> 
 1 N10156   2004 Fixed … EMBRAER      EMB-…       2    55    NA Turbo…
 2 N102UW   1998 Fixed … AIRBUS INDU… A320…       2   182    NA Turbo…
 3 N103US   1999 Fixed … AIRBUS INDU… A320…       2   182    NA Turbo…
 4 N104UW   1999 Fixed … AIRBUS INDU… A320…       2   182    NA Turbo…
 5 N10575   2002 Fixed … EMBRAER      EMB-…       2    55    NA Turbo…
 6 N105UW   1999 Fixed … AIRBUS INDU… A320…       2   182    NA Turbo…
 7 N107US   1999 Fixed … AIRBUS INDU… A320…       2   182    NA Turbo…
 8 N108UW   1999 Fixed … AIRBUS INDU… A320…       2   182    NA Turbo…
 9 N109UW   1999 Fixed … AIRBUS INDU… A320…       2   182    NA Turbo…
10 N110UW   1999 Fixed … AIRBUS INDU… A320…       2   182    NA Turbo…
# … with 3,312 more rows

weather: gives the weather at each NYC airport for each hour

weather
# A tibble: 26,115 x 15
   origin  year month   day  hour  temp  dewp humid wind_dir
   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>
 1 EWR     2013     1     1     1  39.0  26.1  59.4      270
 2 EWR     2013     1     1     2  39.0  27.0  61.6      250
 3 EWR     2013     1     1     3  39.0  28.0  64.4      240
 4 EWR     2013     1     1     4  39.9  28.0  62.2      250
 5 EWR     2013     1     1     5  39.0  28.0  64.4      260
 6 EWR     2013     1     1     6  37.9  28.0  67.2      240
 7 EWR     2013     1     1     7  39.0  28.0  64.4      240
 8 EWR     2013     1     1     8  39.9  28.0  62.2      250
 9 EWR     2013     1     1     9  39.9  28.0  62.2      260
10 EWR     2013     1     1    10  41    28.0  59.6      260
# … with 26,105 more rows, and 6 more variables: wind_speed <dbl>,
#   wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>,
#   time_hour <dttm>

Imagine that you want to draw the route each plane flies from its origin to its destination. What variables would you need? What tables would you need to combine?

# require the latitude and longitude of the origin and destination airports of each flight. 

glimpse(flights)
Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, …
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2,…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, …
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EW…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FL…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, …
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 20…
glimpse(airports) # has lat, lon
Rows: 1,458
Columns: 8
$ faa   <chr> "04G", "06A", "06C", "06N", "09J", "0A9", "0G6", "0G7"…
$ name  <chr> "Lansdowne Airport", "Moton Field Municipal Airport", …
$ lat   <dbl> 41.13047, 32.46057, 41.98934, 41.43191, 31.07447, 36.3…
$ lon   <dbl> -80.61958, -85.68003, -88.10124, -74.39156, -81.42778,…
$ alt   <dbl> 1044, 264, 801, 523, 11, 1593, 730, 492, 1000, 108, 40…
$ tz    <dbl> -5, -6, -6, -5, -5, -5, -5, -5, -5, -8, -5, -6, -5, -5…
$ dst   <chr> "A", "A", "A", "A", "A", "A", "A", "A", "U", "A", "A",…
$ tzone <chr> "America/New_York", "America/Chicago", "America/Chicag…
flights_latlon <- flights %>% 
  inner_join(select(airports, origin = faa,
                    origin_lat = lat,
                    origin_lon = lon),
             by = "origin") %>% 
  inner_join(select(airports, dest = faa,
                    dest_lat = lat,
                    dest_lon = lon),
             by = "dest") 

glimpse(flights_latlon)
Rows: 329,174
Columns: 23
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ dep_time       <int> 517, 533, 542, 554, 554, 555, 557, 557, 558, …
$ sched_dep_time <int> 515, 529, 540, 600, 558, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -6, -4, -5, -3, -3, -2, -2, -2, -2, …
$ arr_time       <int> 830, 850, 923, 812, 740, 913, 709, 838, 753, …
$ sched_arr_time <int> 819, 830, 850, 837, 728, 854, 723, 846, 745, …
$ arr_delay      <dbl> 11, 20, 33, -25, 12, 19, -14, -8, 8, -2, -3, …
$ carrier        <chr> "UA", "UA", "AA", "DL", "UA", "B6", "EV", "B6…
$ flight         <int> 1545, 1714, 1141, 461, 1696, 507, 5708, 79, 3…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "LGA", "EWR", "EWR", "LG…
$ dest           <chr> "IAH", "IAH", "MIA", "ATL", "ORD", "FLL", "IA…
$ air_time       <dbl> 227, 227, 160, 116, 150, 158, 53, 140, 138, 1…
$ distance       <dbl> 1400, 1416, 1089, 762, 719, 1065, 229, 944, 7…
$ hour           <dbl> 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, …
$ minute         <dbl> 15, 29, 40, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 20…
$ origin_lat     <dbl> 40.69250, 40.77725, 40.63975, 40.77725, 40.69…
$ origin_lon     <dbl> -74.16867, -73.87261, -73.77893, -73.87261, -…
$ dest_lat       <dbl> 29.98443, 29.98443, 25.79325, 33.63672, 41.97…
$ dest_lon       <dbl> -95.34144, -95.34144, -80.29056, -84.42807, -…
# first 100 flights

flights_latlon %>% 
  slice(1:100) %>% 
  ggplot(aes(
    x = origin_lon, xend = dest_lon,
    y = origin_lat, yend = dest_lat
  )) +
  borders("state") +
  geom_segment(arrow = arrow(length = unit(0.1, "cm"))) +
  coord_quickmap() +
  labs( y = "Lat",
        x = "Lon")

Keys

Add a surrogate key to flights

flights
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# … with 336,766 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>
flights %>% 
  mutate(flight_id = row_number()) %>% 
  glimpse()
Rows: 336,776
Columns: 20
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, …
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2,…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, …
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EW…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FL…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, …
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 20…
$ flight_id      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…

Mutating Joins

Compute the average delay by destination, then join on the airports data frame so that you can show the spatial distribution of delays.

avg_dest_delays <- flights %>% 
  group_by(dest) %>% 
  summarise(delay = mean(arr_delay, na.rm = T)) %>% 
  inner_join(airports, by = c(dest = "faa")) %>% 
  ggplot(aes(lon, lat, col = delay)) +
  borders("state") +
  geom_point() +
  coord_quickmap() +
  scale_color_viridis_c() +
  theme_classic()

avg_dest_delays 

Add the location of the origin and destination to flights

airport_locations <- airports %>% 
  select(faa, lat, lon)

flights %>% 
  select(year:day, hour, origin, dest) %>% 
  left_join(airport_locations, by = c("origin" = "faa")) %>% 
  left_join(airport_locations, by = c("dest" = "faa"),
            suffix = c("_origin", "_dest"))
# A tibble: 336,776 x 10
    year month   day  hour origin dest  lat_origin lon_origin lat_dest
   <int> <int> <int> <dbl> <chr>  <chr>      <dbl>      <dbl>    <dbl>
 1  2013     1     1     5 EWR    IAH         40.7      -74.2     30.0
 2  2013     1     1     5 LGA    IAH         40.8      -73.9     30.0
 3  2013     1     1     5 JFK    MIA         40.6      -73.8     25.8
 4  2013     1     1     5 JFK    BQN         40.6      -73.8     NA  
 5  2013     1     1     6 LGA    ATL         40.8      -73.9     33.6
 6  2013     1     1     5 EWR    ORD         40.7      -74.2     42.0
 7  2013     1     1     6 EWR    FLL         40.7      -74.2     26.1
 8  2013     1     1     6 LGA    IAD         40.8      -73.9     38.9
 9  2013     1     1     6 JFK    MCO         40.6      -73.8     28.4
10  2013     1     1     6 LGA    ORD         40.8      -73.9     42.0
# … with 336,766 more rows, and 1 more variable: lon_dest <dbl>

Is there a relationship between the age of a plane and its delays?

# merge flights with planes (which contains plane years)
# calculate the average departure delay for each age of flight

plane_cohorts_dep <- inner_join(flights,
                            select(planes, tailnum, plane_year = year),
                            by = "tailnum") %>%
  mutate(age = year - plane_year) %>% 
  filter(!is.na(age)) %>% 
  mutate(age = if_else(age>25, 25L, age)) %>% 
  group_by(age) %>% 
  summarise(dep_delay_mean = mean(dep_delay, na.rm = T),
            dep_delay_sd = sd(dep_delay, na.rm = T),
            n_dep_delay = sum(!is.na(dep_delay))) %>% 
  ggplot(aes(x = age, y = dep_delay_mean)) +
  geom_point() +
  scale_x_continuous("Age of plane(years)", breaks = seq(0,30, by = 10)) +
  scale_y_continuous("Mean Dep Delays (min)") +
  labs(subtitle = "Departure delay increases with age of plane until 10 years, then it declines and flattens out.",
       title = "Relationship between Departure Delay and Age of Plane") +
  theme_classic()



plane_cohorts_arr <- inner_join(flights,
                            select(planes, tailnum, plane_year = year),
                            by = "tailnum") %>%
  mutate(age = year - plane_year) %>% 
  filter(!is.na(age)) %>% 
  mutate(age = if_else(age>25, 25L, age)) %>% 
  group_by(age) %>% 
  summarise(
            arr_delay_mean = mean(arr_delay, na.rm = T),
            arr_delay_sd = sd(arr_delay, na.rm = T),
            n_arr_delay = sum(!is.na(arr_delay))) %>% 
  ggplot(aes(x = age, y = arr_delay_mean)) +
  geom_point() +
  scale_x_continuous("Age of plane(years)", breaks = seq(0,30, by = 10)) +
  scale_y_continuous("Mean Arr Delays (min)") +
  labs(subtitle = "Arr delay increases with age of plane until 10 years, then it declines and flattens out.",
       title = "Relationship between Arr Delay and Age of Plane") +
  theme_classic()

gridExtra::grid.arrange(plane_cohorts_dep, plane_cohorts_arr, nrow = 2) 

What weather conditions make it more likely to see a delay?

flight_weather <- flights %>% 
  inner_join(weather, by = c(
    "origin" = "origin",
    "year" = "year",
    "month" = "month",
    "day" = "day", 
    "hour" = "hour"
  ))

flight_weather %>% 
  mutate(visib_cat = cut_interval(visib, n = 10)) %>% 
  group_by(visib_cat) %>% 
  summarise(dep_delay = mean(dep_delay, na.rm = T)) %>% 
  ggplot(aes(x = visib_cat, y = dep_delay)) +
  geom_point() +
  labs(title = "Relationship beween visibility and delay times",
       subtitle = "A decrease in visibility increases delay timings") +
  theme_classic()

Filtering Joins

Filter flights to only show flights with planes that have flown at least 100 flights.

flights_gte100 <- flights %>% 
  filter(!is.na(tailnum)) %>% 
  group_by(tailnum) %>% 
  count() %>% 
  filter(n >=100)

flights %>% 
  semi_join(flights_gte100, by = "tailnum")
# A tibble: 228,390 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      544            545        -1     1004
 4  2013     1     1      554            558        -4      740
 5  2013     1     1      555            600        -5      913
 6  2013     1     1      557            600        -3      709
 7  2013     1     1      557            600        -3      838
 8  2013     1     1      558            600        -2      849
 9  2013     1     1      558            600        -2      853
10  2013     1     1      558            600        -2      923
# … with 228,380 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Combine fueleconomy::vehicles and fueleconomy::common to find only records for the most common models

fueleconomy::vehicles %>% 
  semi_join(fueleconomy::common, by = c("make", "model"))
# A tibble: 14,531 x 12
      id make  model   year class  trans drive   cyl displ fuel    hwy
   <dbl> <chr> <chr>  <dbl> <chr>  <chr> <chr> <dbl> <dbl> <chr> <dbl>
 1  1833 Acura Integ…  1986 Subco… Auto… Fron…     4   1.6 Regu…    28
 2  1834 Acura Integ…  1986 Subco… Manu… Fron…     4   1.6 Regu…    28
 3  3037 Acura Integ…  1987 Subco… Auto… Fron…     4   1.6 Regu…    28
 4  3038 Acura Integ…  1987 Subco… Manu… Fron…     4   1.6 Regu…    28
 5  4183 Acura Integ…  1988 Subco… Auto… Fron…     4   1.6 Regu…    27
 6  4184 Acura Integ…  1988 Subco… Manu… Fron…     4   1.6 Regu…    28
 7  5303 Acura Integ…  1989 Subco… Auto… Fron…     4   1.6 Regu…    27
 8  5304 Acura Integ…  1989 Subco… Manu… Fron…     4   1.6 Regu…    28
 9  6442 Acura Integ…  1990 Subco… Auto… Fron…     4   1.8 Regu…    24
10  6443 Acura Integ…  1990 Subco… Manu… Fron…     4   1.8 Regu…    26
# … with 14,521 more rows, and 1 more variable: cty <dbl>

Reference

https://r4ds.had.co.nz/

https://jrnold.github.io/r4ds-exercise-solutions/

Citation

For attribution, please cite this work as

lruolin (2021, May 17). pRactice corner: Relational Data. Retrieved from https://lruolin.github.io/myBlog/posts/20210517_Tidyverse Chap 10 - Relational Data/

BibTeX citation

@misc{lruolin2021relational,
  author = {lruolin, },
  title = {pRactice corner: Relational Data},
  url = {https://lruolin.github.io/myBlog/posts/20210517_Tidyverse Chap 10 - Relational Data/},
  year = {2021}
}